Don't Decay the Learning Rate, Increase the Batch Size

نویسندگان

  • Samuel L. Smith
  • Pieter-Jan Kindermans
  • Quoc V. Le
چکیده

It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate and scaling the batch size B ∝ . Finally, one can increase the momentum coefficient m and scale B ∝ 1/(1 −m), although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train Inception-ResNet-V2 on ImageNet to 77% validation accuracy in under 2500 parameter updates, efficiently utilizing training batches of 65536 images.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An integrated vendor–buyer model with stochastic demand, lot-size dependent lead-time and learning in production

In this article, an imperfect vendor–buyer inventory system with stochastic demand, process quality control and learning in production is investigated. It is assumed that there are learning in production and investment for process quality improvement at the vendor’s end, and lot-size dependent lead-time at the buyer’s end. The lead-time for the first batch and those for the rest of the batches ...

متن کامل

Scaling SGD Batch Size to 32K for ImageNet Training

The most natural way to speed-up the training of large networks is to use dataparallelism on multiple GPUs. To scale Stochastic Gradient (SG) based methods to more processors, one need to increase the batch size to make full use of the computational power of each GPU. However, keeping the accuracy of network with increase of batch size is not trivial. Currently, the state-of-the art method is t...

متن کامل

Removing Noise in On-Line Search using Adaptive Batch Sizes

Stochastic (on-line) learning can be faster than batch learning. However, at late times, the learning rate must be annealed to remove the noise present in the stochastic weight updates. In this annealing phase, the convergence rate (in mean square) is at best proportional to l/T where T is the number of input presentations. An alternative is to increase the batch size to remove the noise. In th...

متن کامل

Material to “ Dynamic Word Embeddings ”

L=10 vocabulary size L′=103 batch size for smoothing d=100 embedding dimension for SoU and Twitter d=200 embedding dimension for Google books Ntr =5000 number of training steps for each t (filtering) N ′ tr =5000 number of pretraining steps with minibatch sampling (smoothing; see Algorithm 2) Ntr =1000 number of training steps without minibatch sampling (smoothing; see Algorithm 2) cmax =4 cont...

متن کامل

Three Factors Influencing Minima in SGD

We study the properties of the endpoint of stochastic gradient descent (SGD). By approximating SGD as a stochastic differential equation (SDE) we consider the Boltzmann-Gibbs equilibrium distribution of that SDE under the assumption of isotropic variance in loss gradients. Through this analysis, we find that three factors – learning rate, batch size and the variance of the loss gradients – cont...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1711.00489  شماره 

صفحات  -

تاریخ انتشار 2017